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"The lack of real contact between mathematics and biology is either a tragedy, a 
scandal or a challenge, it is hard to decide which." - Gian-Carlo Rota, 34, p. 2] 

1 Introduction 

The grand challenges in biology today are being shaped by powerful high- 
throughput technologies that have revealed the genomes of many organisms, 
global expression patterns of genes and detailed information about variation 
within populations. We are therefore able to ask, for the first time, funda- 
mental questions about the evolution of genomes, the structure of genes and 
their regulation, and the connections between genotypes and phenotypes of 
individuals. The answers to these questions are all predicated on progress in 
a variety of computational, statistical, and mathematical fields |35j . 

The rapid growth in the characterization of genomes has led to the ad- 
vancement of a new discipline called Phylogenomics. This discipline, whose 
scope and potential was first outlined in [22] . results from the combination 
of two major fields in the life sciences: Genomics, i.e., the study of the 
function and structure of genes and genomes; and Molecular Phylogenetics, 
i.e., the study of the hierarchical evolutionary relationships among organisms 
and their genomes. The objective of this article is to offer mathematicians a 
first introduction to this emerging field, and to discuss specific problems and 
developments arising from phylogenomics. 

The mathematical tools to be highlighted in this paper are statistics, 
probability, combinatorics and - last but not least - algebraic geometry. 
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Emphasis is placed on the use of Algebraic Statistics, which is the study of 
statistical models for discrete data using algebraic methods. See jU, §1] 
for details. Several models which are relevant for phylogenomics are shown 
to be algebraic varieties in certain high- dimensional spaces of probability 
distributions. This interplay between statistics and algebraic geometry offers 
a conceptual framework for (understanding existing and developing new) 
combinatorial algorithms for biological sequence analysis. It is our hope that 
this will contribute to some "real contact" between mathematics and biology. 

This paper is organized as follows. In Section 2 we begin by reviewing 
the organization and structure of genomes. This section is meant as a brief 
tutorial, aimed at readers who have a little or no background in molecular 
biology. It offers definitions of the relevant biological terminology. 

Section 3 describes a very simple example of a statistical model for in- 
ferring information about the genetic code. The point of this example is to 
explain the philosophy of algebraic statistics: model means algebraic variety. 

A more realistic model, which is widely used in computational biology, 
is the hidden Markov model. In Section 4 we explain this model and discuss 
its applications to the problem of identifying genes in a genome. Another 
key problem is the alignment of biological sequences. Section 5 reviews the 
statistical models and combinatorial algorithms for sequence alignment. We 
also discuss the relevance of parametric inference |32j in this context. 

In Section 6 we present statistical models for the evolution of biological 
sequences. These models are algebraic varieties associated with phylogenetic 
trees, and they play a key role in inferring the ancestral relationships among 
organisms and in identifying regions in genomes that are under selection. 

Section 7 gives an introduction to the field of Phylogenetic Combinatorics, 
which is concerned with the combinatorics and geometry of finite metric 
spaces, and their application to data analysis in the life sciences. We shall 
discuss the space of all trees [5], the neighbor-joining algorithm for projecting 
metrics onto this space, and several natural generalizations of these concepts. 

In Section 8 we go back to the data. We explain how one obtains and 
studies DNA sequences generated by genome sequencing centers, and we 
illustrate the mathematical models by estimating the probability that the 
DNA sequence in Conjecture ^occurred by chance in ten vertebrate genomes. 
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2 The Genome 



Every living organism has a genome, made up of deoxyribonucleic acids 
(DNA) arranged in a double helix [SH], which encodes (in a way to be 
made precise) the fundamental ingredients of life. Organisms are divided into 
two major classes: eukaryotes (organisms whose cells contain a nucleus) and 
prokaryotes (for example bacteria). In our discussion we focus on genomes 
of eukaryotes, and, in particular, the human genome [3"%1 loTj. 

Eukaryotic genomes are divided into chromosomes. The human genome 
has two copies of each chromosome. There are 23 pairs of chromosomes: 22 
autosomes (two copies each in both men and women) and two sex chromo- 
somes, which are denoted X and Y. Women have two X chromosomes, while 
men have one X and one Y chromosome. Parents pass on a mosaic of their 
pair of chromosomes to their children. 

The sequence of DNA molecules in a genome is typically represented as a 
sequence of letters, partitioned into chromosomes, from the four letter alpha- 
bet Q = {A, C, G, T}. These letters correspond to the bases in the double 
helix, that is, the nucleotides Adenine, Cytosine, Guanine and Thymine. 
Since every base is paired with an opposite base (A with T, and C with G in 
the other half of the double helix), in order to describe a genome it suffices 
to list the bases in only one strand. However, it is important to note that the 
two strands have a directionality which is indicated by the numbers 5' and 
3' on the ends (corresponding to carbon atoms in the helix backbone). The 
convention is to represent DNA in the 5' — > 3' direction. The human genome 
consists of approximately 2.8 billion bases, and has been obtained using high 
throughput sequencing technologies that can be used to read the sequence of 
short DNA fragments hundreds of bases long. Sequence assembly algorithms 
are then used to piece together these fragments [HH]- See also |HJ §4]. 

Despite the tendency to abstract genomes as strings over the alphabet 
Q, one must not forget that they are highly structured: for example, certain 
subsequences within a genome correspond to genes. These subsequences 
play the important role of encoding proteins. Proteins are polymers made 
of twenty different types of amino acids. Within a gene, triplets of DNA, 
known as codons, encode the amino acids for the proteins. This is known 
as the genetic code. Table 1 shows the 64 possible codons, and the twenty 
amino acids they code for. Each amino acid is represented by a three letter 
identifier ("Phe" = Phenylalanine, "Leu" = Leucin, ....). The three codons 
TAA, TAG and TGA are special: instead of coding for an amino acid, they 
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T 


C 


A 


G 




TTT i-» Phe 


TCT i-> 


Ser 


TAT i-> Tyr 


TGT i-> 


Cys 


T 


TTC i-» Phe 


TCC i ^ 


Ser 


TAC i — ^ Tyr 


TGC i ^ 


Cys 


TTA i-> Leu 


TCA i — ^ 


Ser 


TAA i — ^ stop 


TGA i — ► 


stop 




TTG i-» Leu 


TCG i — ^ 


Ser 


TAG i — ► stop 


TGG i — ^ 


Trp 




CTT i-» Leu 


CCT i-» 


Pro 


CAT i-> His 


CGT i-> 


Arg 

o 


C 


CTC ' ► Leu 


CCC i ► 


Pro 


CAC ^ His 


CGC i ^ 


Arg 


CTA i-> Leu 


CCA i — ^ 


Pro 


CAA i — ^ Gin 


CGA i — ^ 


Arg 




CTG i — ► Leu 


CCG ' ^ 


Pro 


CAG i — ► Gin 


CGG i ^ 


Arg 




ATT ^ lie 


ACT i — ► 


Thr 


AAT i-> Asn 


AGT i — ^ 


Ser 




ATC ^ He 


ACC i — ► 


Thr 


AAC i-» Asn 


AGC i — ^ 


Ser 


A 


ATA i-> lie 


ACA i — ^ 


Thr 


AAA i — ^ Lys 


AGA i — ► 


Arg 




ATG i — ► Met 


ACG i — ^ 


Thr 


AAG i — ^ Lys 


AGG i — ^ 


Arg 




GTT i-> Val 


GCT i ^ 


Ala 


GAT h-> Asp 


GGT i ^ 


Gly 


G 


GTC i ^ Val 


GCC i ^ 


Ala 


GAC ' ► Asp 


GGC i ^ 


Gly 


GTA i ^ Val 


GCA i — ► 


Ala 


GAA i — ^ Glu 


GGA i — ^ 


Gly 




GTG i ^ Val 


GCG i ^ 


Ala 


GAG ' ^ Glu 


GGG ' ^ 


Gly 



Table 1: The genetic code. 



are used to indicate that the protein ends. 

In order to make protein, DNA is first copied into a similar molecule called 
messenger RNA (abbreviated mRNA) in a process called transcription. It is 
the RNA that is translated into protein. The entire process is referred to as 
expression. Proteins can be structural elements, or perform complex tasks 
(such as regulation of expression) by interacting with the many molecules and 
complexes in cells. Thus, the genome is a blueprint for life. An understanding 
of the genes, the function of their proteins, and their expression patterns is 
fundamental to biology. 

The human genome contains approximately 25, 000 genes, although the 
exact number has still not been determined. While there are experimental 
methods for validating and discovering genes, there is still no known high 
throughput technology for accurately identifying all the genes in a genome. 
The computational problem of identifying genes, the gene finding problem, 
is an active area of research. One of the main difficulties lies in the fact that 
only a small portion of any genome is genie. For instance, less than 5% of the 
human genome is known to be functional. In Section 4 we discuss this prob- 
lem, and the role of probabilistic models in formulating statistically sound 
methods for distinguishing genes from non-genic sequence. The models of 
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choice, hidden Markov models, allow for the integration of diverse biological 
information (such as the genetic code and the structure of genes) and yet are 
suitable for designing efficient algorithms. By virtue of being algebraic vari- 
eties, they provide a key example for the link between algebra, statistics and 
genomics. Nevertheless, the current understanding of genes is not sufficient 
to allow for the ab-initio identification of all the genes in a genome, and it is 
through comparison with other genomes that the genes are revealed 

The differences between the genomes of individuals in a population are 
small and are primarily due to recombination events (part of the process by 
which two copies of parental chromosomes are merged in the offspring). On 
the other hand, the genomes of different species (classes of organisms that 
can produce offspring together) tend to be much more divergent. Genome 
differences between species can be explained by many biological events in- 
cluding: 

• Genome rearrangement - comparing chromosomes of related species re- 
veals large segments that have been reversed and flipped (inversions), 
segments that have been moved (transpositions), fusions of chromo- 
somes, and other large scale events. The underlying biological mecha- 
nisms are poorly understood I48j . 

• Duplications and loss - some genomes have undergone whole genome 
duplications. This process was recently demonstrated for yeast |3T)] . 
Individual chromosomes or genes may also be duplicated. Duplication 
events are often accompanied by gene loss, as redundant genes slowly 
lose or adapt their function over time [23J. 

• Parasitic expansion - large sections of genomes are repetitive, consist- 
ing of elements which can duplicate and re-integrate into a genome. 

• Point mutation, insertion and deletion - DNA sequences mutate, and 
in non-functional regions these mutations accumulate over time. Such 
regions are also likely to exhibit deletions; for example, strand slippage 
during replication can lead to an incorrect copy number for repeated 
bases. 

Accurate mathematical models for sequence alignment and evolution, our 
topics in Sections 5-7, have to take these processes into consideration. 
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Two distinct DNA bases that share a common ancestor are called ho- 
mologous. Homologous bases can be related via speciation and duplication 
events, and are therefore divided into two classes: orthologous and paralo- 
gous. Orthologous bases are descendant from a single base in an ancestral 
genome that underwent a speciation event, whereas two paralogous bases cor- 
respond to two distinct bases in a single ancestral genome that are related 
via a duplication. Because we cannot sequence ancestral genomes, it is never 
possible to formally prove that two DNA bases are homologous. However, 
statistical arguments can show that it is extremely likely that two bases are 
homologous, or even orthologous. The problem of identifying homologous 
bases between genomes of related species is known as the alignment problem. 
We shall discuss this in Section 5. 

The alignment of genomes is the first step in identifying highly conserved 
sequences that point to the small fraction of the genome that is under selec- 
tion, and therefore likely to be functional. Although the problem of sequence 
alignment is mathematically and computationally challenging, proposed ho- 
mologous sequences can be rapidly and independently validated (it is easy to 
check whether two sequences align once they have been identified), and the 
regions can often be tested in a molecular biology laboratory to determine 
their function. In other words, sequence alignment reveals concrete verifiable 
evidence for evolutionary selection and often results in testable hypotheses. 

As a focal point for our discussion, we present a specific DNA sequence 
of length 42. This sequence was found in April 2004 as a byproduct of com- 
putational work conducted by Lior Pachter's group at Berkeley [TU]. Whole 
genome alignments were found and analyzed of the human (hs), chimpanzee 
(pt), mouse (mm), rat (rn), dog (cf), chicken (gg), frog (xt), zebra-fish (dr), 
fugu-fish (tr) and tetraodon (tn) genomes. The abbreviations refer to the 
Latin names of these organisms. They will be used in Table 3 and Figure EJ 
From alignments of the ten genomes, the following hypothesis was derived, 
which we state in the form of a mathematical conjecture. 

Conjecture 1. (The "Meaning of Life") The sequence of 42 bases 

TTTAATTGAAAGAAGTTAATTGAATGAAAATGATCAACTAAG ( 1 ) 

was present in the genome of the ancestor of all vertebrates, and it has been 
completely conserved to the present time (i.e., none of the bases have been 
mutated, nor have there been any insertions or deletions). 
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The identification of such a sequence requires a highly non-trivial compu- 
tation: the alignment of ten genomes (including mammalian genomes close 
to 3 billion bases in length) and subsequent analysis to identify conserved 
orthologous regions within the alignment [60J. Using the tools described in 
Section 8, one checks that the sequence (0) is present in all ten genomes. For 
instance, in the human genome (May 2004 version), the sequence occurs on 
chromosome 7 in positions 156501197-156501238. By examining the align- 
ment, one verifies that, with very high probability, the regions containing this 
sequence in all ten genomes are orthologous. Furthermore, the implied claim 
that (JH) occurs in all present-day vertebrates can, in principle, be tested. 

Identifying and analyzing sequences such as (JTJ) is important because they 
are highly conserved yet often non-genic [7] . One of the ongoing mysteries in 
biology is to unravel the function of the parts of the genome that is non-genic 
and yet very conserved. The extent of conservation points to the possibility 
of critical functions within the genome. It may be a coincidence that the 
segment above contains two copies of the motif TTAATTGAA, but this motif 
may also have some function (for example, it may be bound by a protein). 
Indeed, the identification of such elements is the first step towards under- 
standing the complex regulatory code of the genome. Back in 2003, we were 
amused to find that 42 was the length of the longest such sequence. In light 
of [1 , it was decided to name this DNA-sequence "The Meaning of Life" . 

The conjecture was formulated in the spring of 2004 and it was circulated 
in the first arXiv version of this paper. In the fall of 2004, Drton, Eriksson 
and Leung [21 J conducted a new study based on improved alignments. Their 
work led to even longer sequences with similar properties. So, the Meaning 
of Life sequence no longer holds the record in terms of length. However, since 
Conjecture ^ has been inspiration for our group, and it still remains open 
today, we decided to stick with this example. It needs to be emphasized 
that disproving Conjecture ^ would not invalidate any of the methodology 
presented in this article. For a biological perspective we refer to [2*T] . 

3 Codons 

Because of the genetic code, the set Q 3 of all three-letter words over the 
alphabet Q = {A, C, G, T} plays a special role in molecular biology. As 
was discussed in Section 2, these words are called codons, with each triplet 
coding for one of 20 amino acids (Table 1). The map from 64 codons to 
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20 amino acids is not injective, and so multiple codons code for the same 
amino acid. Such codons are called synonymous. Eight amino acids have 
the property that the synonymous codons that code for them all agree in the 
first two positions. The third positions of such codons are called four-fold 
degenerate. The translation of a series of codons in a gene (typically a few 
hundred) results in a three-dimensional folded protein. 

A model for codons is a statistical model whose state space is the 64- 
element set Q 3 . Selecting a model means specifying a family of probability 
distributions p = (pijk) on Q 3 . Each probability distribution p is a 4 x 4 x 
4-table of non-negative real numbers which sum to one. Geometrically, a 
distribution on codons is a point p in the 63-dimensional probability simplex 

A 63 = { p G R n3 : Pvk = 1 and Puk > for all UK G Q 3 }. 

ijk en 3 

A model for codons is hence nothing but a subset M. of the simplex Ag3. 
Statistically meaningful models are usually given in parametric form. If the 
number of parameters is d, then there is a set V C M. d of allowed parameters, 
and the model M. is the image of a map </> from V into A 63 . We illustrate 
this statistical point of view by means of a very simple independence model. 

Models for codons have played a prominent role in the work of Samuel 
Karlin, who was one of the mathematical pioneers in this field. One instance 
of this is the genome signature in ^3]. We refer to [HJ Example 4.3] for a 
discussion of this model and more recent work on codon usage in genomes. 

Consider a DNA sequence of length 3m which has been grouped into 
m consecutive codons. Let uuk denote the number of occurrences of a 
particular codon UK. Then our data is the 4 x 4 x 4-table u = (ujjk)- The 
entries of this table are non-negative integers, and if we divide each entry by 
m then we get a new table — • u which is a point in the probability simplex 
A63- This table is the empirical distribution of codons in the given sequence. 

Let Ai be the statistical model which stipulates that, for the sequence 
under consideration, the first two positions in a codon are independent from 
the third position. We may wish to test whether this independence model fits 
our data u. This question makes sense in molecular biology because many of 
the amino acids are uniquely specified by the first two positions in any codon 
which represents that particular amino acid (see Table 1). Therefore, third 
positions in synonymous codons tend to be independent of the first two. 

Our independence model M. has 18 free parameters. The set of allowed 
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parameters is an 18-dimensional convex polytope, namely, it is the product 

V = A 15 x A 3 . 

Here A 15 is the 15-dimensional simplex consisting of probability distribu- 
tions a = (oiu) on Q 2 , and A3 is the tetrahedron consisting of probability 
distributions (3 = (/3k) on Q. Our model A4 is parameterized by the map 

4> : V -> A 63 , 4>((a,f3)) IJK = au-(3 K - 

Hence M. = image (0) is an 18-dimensional algebraic subset inside the 63- 
dimensional simplex. To test whether a given 4 x 4 x 4-table p lies in A4, we 
write that table as a two-dimensional matrix with 16 rows and 4 columns: 



/ Paaa 


Paac 


Paag 


Paat\ 


Paca 


Pacc 


Pacg 


Pact 


Paga 


Pagc 


Pagg 


Pact 


Pata 


Patc 


Patg 


Patt 


PCAA 


PCAC 


PCAG 


Pcat 


\Ptta 


Pttc 


Pttg 


Pttt / 



Linear algebra furnishes the following characterizations of our model: 
Proposition 2. For a point p G A 63; the following conditions are equivalent: 

1. The distribution p lies in the model M.. 

2. The 16 x 4 matrix p' has rank one. 

3. All 2 x 2-minors of the matrix p' are zero. 

4- Pijk ■ Plain = Pun • Plmk for all nucleotides I, J, K, L,M,N. 

In the language of algebraic geometry, the model M. is known as the Segre 
variety. More precisely, M. is the set of non-negative real points on the Segre 
embedding of P 15 x P 3 in P 63 . Here and throughout, the symbol P m denotes 
the complex projective space of dimension m. One of the points argued 
in this paper is that many of the more advanced statistical models, such 
as graphical models [HI §1.5], actually used in practice by computational 
biologists are also algebraic varieties with a special combinatorial structure. 
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Returning to our original biological motivation, we are faced with the 
following statistics problem. The DNA sequence under consideration is sum- 
marized in the data u, and we wish to test whether or not the model Ai fits 
the data. The geometric idea of such a test is to determine whether or not 
the empirical distribution — ■ u lies close to the Segre variety M.. Statisti- 
cians have devised a wide range of such tests, each representing a statistically 
meaningful notion of "proximity to M. n . These include the x 2 -test, the G 2 - 
test, Fisher's exact test, and others, as explained in standard statistics texts 
such as jH] or j2HI • A useful tool of numerical linear algebra for measuring the 
distance of a point to the Segre variety is the singular value decomposition of 
the matrix p' . Indeed, p' lies on M. if and only if the second singular value of 
p' is zero. Singular values provide a good notion of distance between a given 
matrix and various determinantal varieties such as M.. 

One key ingredient in statistical tests is maximum likelihood estimation. 
The basic idea is to find those model parameters ajj and [3k which would 
best explain the observed data. If we consider all possible genome sequences 
of length 3m, then the likelihood of observing our particular data u equals 

7- n p u ijk, 

where 7 is a combinatorial constant. This expression is a function of (a, (3), 
called the likelihood function. We wish to find the point in our parameter 
domain V = A 15 x A 3 which maximizes this function. The solution (a, (3) 
to this non-linear optimization problem is said to be the maximum likelihood 
estimate for the data u. In our independence model, the likelihood function 
is convex, and it is easy to write down the global maximum explicitly: 

&u = — y~] uijk and (3 K = — V] u IJK . 
Ken /Jen 2 

In general, the likelihood function of a statistical model will not be convex, 
and there is no easy formula for writing the maximum likelihood estimate as 
a function of the data. In practice, numerical hill-climbing methods are used 
to solve this optimization problem, but, of course, there is no guarantee that 
a local maximum found by such methods is actually the global maximum. 
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4 Gene Finding 



In order to find genes in DNA sequences, it is necessary to identify struc- 
tural features and sequence characteristics that distinguish genie sequence 
from non-genic sequence. We begin by describing more of the detail of gene 
structure which is essential in developing probabilistic models. 

Genes are not contiguous subsequences of the genome, but rather split 
into pieces called introns and exons. After transcription, introns are spliced 
out and only the remaining exons are used in translation (Figure 1). Not 
all of the sequence in the exons is translated; the initial and terminal exons 
may consist of untranslated regions (indicated in grey in the figure). Since 
the genetic code is in (non-overlapping) triplets, it follows that the lengths 
of the translated portions of the exons must sum to mod 3. In addition to 




Figure 1: Structure of a gene. 

the exon-intron structure of genes, there are known sequence signals. The 
codon ATG initiates translation, and thus is the first codon following the 
untranslated portion of the initial exons. The final codon in a gene must 
be one of TAG,TAA or TGA, as indicted in Table 1. These codons signal 
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the translation machinery to stop. There are also sequence signals at the 
intron-exon boundaries: GT at the 5' end of an intron and AG at the 3' end. 

A hidden Markov model (HMM) is a probabilistic model that allows for 
simultaneous modeling of the bases in a DNA sequence of length n and the 
structural features associated with that sequence. The HMM consists of n 
observed random variables Yi, . . . , Y n taking on / possible states, and n hidden 
random variables X 1; . . . ,X n taking on k possible states. In the context of 
phylogenomics, the observed variables Yi usually have I = 4 states, namely 
Q = {A, C, G, T}. The hidden random variables Xi serve to model features 
associated with the sequence which is generated by Yi, Y 2 , . . . , Y n . A simple 
scenario is k = 2, with the set of hidden states being = { exon, intron}. 

The characteristic property of an HMM is that the distributions of the Yi 
depend on the Xi, while the Xi form a Markov chain. This is illustrated for 
n = 3 in Figure 2, where the unshaded circles represent the hidden variables 
Xi,X 2 , X 3 and the shaded circles represent the observed variables Yi, Y 2 , Y 3 . 









Figure 2: The hidden Markov model of length three. 

Computational biologists use HMMs to annotate DNA sequences. The 
basic idea is this: it is postulated that the bases are instances of the random 
variables Y%, . . . ,Y n , and the problem is to identify the most likely assign- 
ments of states to Xi, . . . , X n that could be associated with the observations. 
In gene finding, homogeneous HMMs are used. This means that all transition 
probabilities Xi —>■ X i+ i are given by the same k x /c-matrix S = (sij), and 
all the transitions Xi — > Yi are given by another k x 4-matrix T = (tij). 
Here Sjj represents the probability of transitioning from hidden state i to 
hidden state j; for instance, if k = 2 then i,j G © = {exon, intron}. The 
parameter t^ represents the probability that state i 6 9 outputs letter j 6 Q. 



12 



In practice, the parameters and range over real numbers satisfying 
Sij, tij > and J^sy = = 1. (2) 

However, just like in our discussion of the Segre variety in Section 3, we may 
relax the requirements (j2J and allow the parameters to be arbitrary complex 
numbers. This leads to the following algebraic representation [321 §2]. 

Proposition 3. The homogeneous HMM is the image of a map <p '■ C k( - k+ ^ — > 
C 1 ", where each coordinate of (f) is a bi-homogeneous polynomial of degree n — 1 
in the transition probabilities s^ and degree n in the output probabilities ty. 

The coordinate <p a of the map <fi indexed by a particular DNA sequence 
a G Q n represents the probability that the HMM generates the sequence a. 
The following explicit formula for that probability establishes Proposition El 

0<j — ^ ] then f s i\i?Piioi f ^ 1 g »2»3^3Q"3 ( ^ ] s i3t4^i4C4 ( ' '' (^) 

The expansion of this polynomial has k n terms 

ti 1 a- 1 Si 1 i 2 ti 2 a2 S i2h^hcF3 ' ' ' S i n -lin^in^ n - (4) 

For any fixed parameters as in (J2J), one wishes to determine a string i = 
(it, %ii ■ ■ ■ , i n ) £ @ n which indexes a term of largest numerical value 
among all fc™ terms of (If there is more than one string with maximum 
value then we break ties lexicographically). We call i the explanation of the 
observation a. In our example {k = 2,1 = 4), the explanation i of a DNA 
sequence a is an element of G n = { exon, intron} n . It reveals the crucial 
information of Figure 1, namely, the location of the exons and introns. In 
summary, the DNA sequence to be annotated by an HMM corresponds to 
the observation o G Q n , and the explanation i is the gene prediction. Thus 
gene finding means nothing but computing the output i from the input cr. 

In real-world applications, the integer n may be quite large. It is not 
uncommon to annotate DNA sequences of length n > 1, 000, 000. The size k n 
of the search space for finding the explanation is enormous (exponential in n) . 
Fortunately, the recursive decomposition in Q, reminiscent of Horner's Rule, 
allows us to evaluate a multivariate polynomial with exponentially many 
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terms in linear time (in n). In other words, for given numerical parameters 
Sij and tij, we can compute the probability (f) a {sij, Uj) quite efficiently. 

Similarly, the explanation i of an observed DNA sequence a can be com- 
puted in linear time. This is done using the Viterbi algorithm, which evaluates 

max Tj 1(T1 + 1 max S,^ j 2 +Tj 2CT2 + 1 max Si 2 i 3 +Tj 3(T3 + 1 max Si 3 j 4 +Tj 4CT4 +(■■■)))) 

ilG© \i2&& V^G© VMS© /// 

where = log(sy) and T^- = log(ty). This expression is a piecewise linear 
convex function on M. k ( k+l \ known as the tropicalization of the polynomial 
4> a - Indeed, evaluating this expression requires exactly the same operations 
as evaluating <p a , with the only difference that we are replacing ordinary 
arithmetic by the tropical semiring. The tropical semiring (also known as 
the max-plus algebra) consists of the real numbers K together with an extra 
element oo, where the arithmetic operations of addition and multiplication 
are redefined to be max (or equivalently min) and plus respectively. The 
tropical semiring and its use in dynamic programming optimizations is ex- 
plained in gU §2.1]. 

Every choice of parameters (sij,tij) specifies a gene finding function 

Q n -> 9 n , ff Hi 

which takes a sequence a to its explanation i. The number of all functions 
from Q n to O n equals 2 n ' 4 ™ and hence grows double-exponentially in n. 
However, the vast majority of these functions are not gene finding functions. 
The following remarkable complexity result was proved by Elizalde |2*4] : 

Theorem 4. The number of gene finding functions grows at most polynomi- 
al^ in the sequence length n. 

As an illustration consider the n = 3 example visualized in Figure 2. 
There are 8 64 = 6.277 • 10 57 functions {A, C, G, T} 3 — > {exon, intron} 3 
but only a tiny fraction of these are gene finding functions. (It would be 
interesting to determine the exact number). It is an open problem to give 
a combinatorial characterization of gene finding functions, and to come up 
with accurate lower and upper bounds for their number as n grows. 

For gene finding HMMs, it is always the case that I is small and fixed 
(usually, 1 — 4), and n is large. However, the size of k or structure of the 
state space for the hidden variables X, tends to vary a lot. While the k = 2 
used in our discussion of gene finding functions was meant to be just an 
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illustration, a biologically meaningful gene finding model could work with 
just three hidden states: one for introns, one for exons, and a state for 
intergenic sequence. However, in order to enforce the constraint that the 
sum of the lengths of the exons is mod 3, a more complicated hidden state 
space is necessary. Solutions to this problem were given in [T^ l3Tj. 

We conclude this section with a brief discussion of the important problem 
of estimating parameters for HMMs. Indeed, so far nothing has been said how 
the values of the parameters and are to be chosen when running the 
Viterbi algorithm. Typically, this choice involves a combination of biological 
and statistical considerations. Let us concentrate on the latter aspect. 

Recall that maximum likelihood estimation is concerned with finding pa- 
rameters for a statistical model which best explain the observed data. As 
was the case for the codon model (Section 3), the maximum likelihood es- 
timate is an algebraic function of the data. In contrast to what we did at 
the end of Section 3, it is now prohibitive to locate the global maximum in 
the polytope The expectation-maximization (EM) algorithm is a general 
technique used by statisticians to find local maxima of the likelihood func- 
tion §1.3]. For HMMs, this algorithm is also known as the Baum- Welch 
algorithm. It takes advantage of the recursive decomposition in (J3J) and it is 
fast (linear in n). The widely used book [TB] provides a good introduction 
to the use of the Baum- Welch algorithm in training HMMs for biological 
sequence applications. The connection between the EM algorithm and the 
Baum- Welch algorithm is explained in detail in [33] . In order to understand 
the performance of EM or to develop more global methods [14J, it would be 
desirable to obtain upper and lower bounds on the algebraic degree of 
the maximum likelihood estimate. 

5 Sequence Alignment 

Although tools such as the hidden Markov model are important for modeling 
and analyzing individual genome sequences, the essence of phylogenomics lies 
in the power of sequence comparison. Because functional sequences tend to 
accumulate fewer mutations over time, it is possible, by comparing genomes, 
to identify and characterize such sequences much more effectively. 

In this section we examine models for sequence evolution that allow for 
insertions, deletions and mutations in the special case of two genomes. These 
are known as pairwise sequence alignment models. The specific model to be 
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discussed here is the pair hidden Markov model. In the subsequent section 
we shall examine phylogenetic models for more than two DNA sequences. 

We have already seen two instances of statistical models that are repre- 
sented by polynomials in the model parameters (the codon model and the 
hidden Markov model). Models for pairwise sequence alignment are also 
specified by polynomials, and are in fact close relatives of hidden Markov 
models. What distinguishes the sequence alignment problem is an extra layer 
of complexity which arises from a combinatorial explosion in the number of 
possible alignments between sequences. Here we describe one of the simplest 
alignment models (for a pair of sequences), with a view towards connections 
with tree models and algebraic statistics. 

Given two sequences a 1 = o\o\ ■ • ■ cr* and a 2 = o~\o\ • ■ ■ cr^ over the 
alphabet Q = {A,C,G,T}, an alignment is a string over the auxiliary al- 
phabet {M, I, D} such that #M + #£> = n and #M + #1 = m. Here 
#M, t^J, j^D denote the number of characters M, I, D in the word respec- 
tively. An alignment records the "edit steps" from the sequence a 1 to the 
sequence a 2 , where edit operations consist of changing characters, preserving 
them, or inserting/deleting them. An I in the alignment string corresponds 
to an insertion from the first sequence to the second, a D is a deletion from 
the first sequence to the second, and an M is either a character change, or 
lack thereof. The set A n , m of all alignments depends only on the integers n 
and m, and not on a 1 and a 2 . 

Proposition 5. The cardinality of the set A n , m of all alignments can be 
computed as the coefficient of the monomial x m y n in the generating function 

= i +x + y + x 2 + s xy + y' 2 ^ \-x 5 + 9x 4 y + 25x 3 y 2 -\ 

1 — x — y — xy 

These cardinalities |^4. n)m | are known as Delannoy numbers in combina- 
torics [23 §6.3]. For instance, there are | ^4.2,3 1 — 25 alignments of two 
sequences of length two and three. They are listed in Table 2 below. 

The pair hidden Markov model is visualized graphically in Figure EH The 
hidden random variables (unshaded nodes forming the Markov chain) take 
on the values M, I, D. Depending on the state at a hidden node, either one 
or two characters are generated; in this way, pair hidden Markov models dif- 
fer from standard hidden Markov models. The squares around the observed 
states (called plates) are used to indicate that the number of characters gen- 
erated may vary depending on the hidden state. The number of characters 
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Figure 3: A pair hidden Markov model for sequence alignment. 

generated is a random variable, indicated by unshaded nodes within the 
plates (called class nodes). In pair hidden Markov models, the class nodes 
take on the values or 1 corresponding to whether or not a character is gen- 
erated. Pair hidden Markov models are therefore HMMs, where the structure 
of the model depends on the assignments to the hidden states. The graphical 
model structure of pair HMMs is explained in more detail in [2]. 

The next proposition gives the algebraic representation of the pair hidden 
Markov model. For a given alignment a e A n , m , we denote the jth character 
in a by aj, we write a[i] for #M + j^D in the prefix a^a^ ■ ■ - di, and we 
write a(j) for j^M + #1 in the prefix a\a>2 ■ ■ ■ a,j. Let a 1 and a 2 be two 
DNA sequences of lengths n, m respectively. Then the probability that our 
model generates these two sequences equals 



E 



«1 



v°a[l]> °a(l>/ 



n 



(5) 



where the parameter s ai _ iai is the transition probability from state aj_i to 
di, and the parameter ^(aV,, °"a(i>) ^ s ^ e ou tput probability for a given 
state cii and the indicated output characters on the strings a 1 and a 2 . 
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Proposition 6. The pair hidden Markov model for sequence alignment is 
the image of a polynomial map : C 33 — > C 4 " +m . The coordinates of the 
map <p are the polynomials of degree < 2n + 2m — 1 which are given in J3J). 

We need to explain why the number of parameters in our representation 
of the pair hidden Markov model is 33. First, there are nine parameters 

(smm smi smd 
Sim s n sid 
sdm sdi sdd 

which play the same role as in Section 4, namely, they represent transition 
probabilities in the Markov chain. There are 16 parameters £m(oj b) =: tMab 
for the probability that letter a in a 1 is matched with letter b in a 2 . The 
insertion parameters ti(a,b) depend only on the letter b, and the deletion 
parameters to (a, b) depend only on the letter a, so there are only 8 of these 
parameters. Hence the total number of (complex) parameters is 9 + 16 + 8 = 
33. Of course, in our applications, probabilities are non-negative reals that 
sum to one, so we get a reduction in the number of parameters, just like in 
(J2J). In the upcoming example, which explains the algebraic representation 
of Proposition El we use the abbreviations £/& and tr> a for these parameters. 

Consider two sequences a 1 = ij and a 2 = klm of length n = 2 and m = 3 
over the alphabet Q = {A, C, G, T}. The number of alignments is | ^4-2,3 1 = 
25, and they are listed in Table 2. For instance, the alignment MUD, here 

written ( i ■ -j , klm •), corresponds to ^ ^ in standard genomics notation. 

The polynomial (f) a i c i is the sum of the 25 monomials (of degree 9, 7, 5) 
in the rightmost column. Thus the pair hidden Markov model presented in 
Table 2 is nothing but a polynomial map 

. c 33 -> C 1024 . 

Statistics is all about making inferences. We shall now explain how this 
is done with this model. For any fixed parameters s.. and t.. , one wishes 
to determine the alignment a e A n . m which indexes the term of largest 
numerical value among the Delannoy many terms of the polynomial <j) ffl ^. 
(If there is more than one alignment with maximum value then we break ties 
lexicographically). We call a the explanation of the observation (a 1 , a 2 ). 

The explanation for a pair of DNA sequences can be computed in polyno- 
mial time (in their lengths n and m) using a variant of the Viterbi algorithm. 
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IIIDD 


( • • -ij , klm ■ ■ ) 


t TL-S t it ti S t it Tm S t nt m S nnt n-i 

J h, ~^ L L w L I ~^ L L w 1 iil~^L I J I J 1 1 J 1 ' U J 


IIDID 


( • • i • j , kl ■ m- ) 


t ThS T 7i Tl S T nt n»' S DTi Tm S T nt D-i' 


IIDDI 


( • • ij • , kl • •m) 


t TbS t it Ti S t nt niS n nt n-iS nit r m 

1 Fu 1 1 1 t 1 1 J 1 J I 1 J 1 ' U J 1 J 1 1 ill 


IDIID 


( • i ■ • 7 , k ■ lm- ) 


t ThS t nt m S nit ti S t it Tm S Tntnj 


IDIDI 


(■ i ■ j- , k ■ I ■ m) 


t ThS T nt n; S D Tt Tl S T nt n, S n Tt Tm 


IDDII 


( ■ ij ■ ■ , • -/m ) 


t ThS t nt m S n nt n^i S n Tt ti S t Tt Tm 


DIIID 


( i • ■ ■ i , ■ klm- ) 


t n* SnTtThSTTtTlSTTt Tvr> S T D t D-i' 
v LJ I v 1 K 1 1 It 1 1 1 Til I U UJ 


DIIDI 


(i • - j- , ■ kl ■ m) 


t DiS Dltlk^IltjiSj DtDjS Dltlm 


DIDII 


(i ■ j ■ ■ , ■ k ■ lm ) 


tniSnTtjhSTntnjSnjtT]STTtT m 


DDIII 


( ij • • • , ■ ■ klm ) 


t mS nnt m s nrt ThS t rt Ti s t rt Tm 

LJ i LJ LJ -LJ J LJ 1 1 FX 1 I 1 L 1 1 1 lit 


MUD 


(i • - j , klm ■ ) 


t MihS M Tt Tl S T Tt Tm S T nt Di 

1V1 In, 1V1 1 1 L 1 1 1 II L 1 LJ L/j 


MIDI 


(i • j- , kl • to) 


tMikSMltllSiDtDjSDltlm 


MDII 


(ij ■ ■ , k • lm) 


tMikSMDtDj SDltllS ntj m 


IMID 


(■i • j , klm- ) 


tlkSlMtMilSMltlmSlDtDj 


IMDI 


(•ij ■ , kl ■ m) 


tlkSlMtMilSMDtDjSDltlm 


HMD 


(• • ij , klm ■ ) 


tlkSlltnSiMtMimSMDtDj 


IIDM 


(• • ij , kl ■ m) 


tlk s IltllSiDtDiSDMtMjm 


IDMI 


(•ij- , k • lm ) 


tlk s IDtDi s DMtMjl s Mltlm 


IDIM 


(•i ■ j , k ■ lm) 


tlkSlDtDiSDltllSiMtMjm 


DMII 


(ij ■ ■ , • klm ) 


tDiSDMtMjkSMltllSntim 


DIMI 


(i ■ j- , ■ klm ) 


tDiSDltlkSlMtMjlSMltlm 


DIIM 


(i ■ -j , ■ klm ) 


tDiSDltlkSlltllSiMtMjm 


MMI 


(ij ■ , fcZm ) 


tMikSMMtMjlSMltlm 


MIM 


( i • j , fcZm ) 


tMikSMltllSiMtMjm 


IMM 


( • , /c/m ) 


tlkSlMtMilSMMtMjm 



Table 2: Alignments for a pair of sequences of length 2 and 3. 
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Just like in the previous section, the key idea is to tropicalize the coordinate 
polynomials (j5J) of the statistical model in question. Namely, we compute 

M 

max T ai (al {1] , a 2 a{1) ) + S ^ia, + T a . (a l a[i] , a 2 a{i) ) , (6) 

i=2 

where S.. = log(s..) and T.. = log(£..). The "arg max" of this piecewise 
linear convex function is the optimal alignment a. Inference in the pair HMM 
means computing the optimal alignment of two observed DNA sequences. In 
other words, by inference we mean evaluating the alignment function 

Q n x tt m -f A n , m , {a 1 , a 2 ) ^ a. 

There are doubly-exponentially many functions from Q n x Q m to A n , m , 
but, by Elizalde's Few Inference Functions Theorem at most polyno- 
mially many of them are alignment functions. Like for gene finding functions 
(cf. Theorem it is an open problem to characterize alignment functions. 

The function IR 33 — ► K given in (jUJ) is the support function of a convex 
polytope in M 33 , namely, the Newton polytope of the polynomial (pa 1 , a 2 - The 
vertices of this polytope correspond to all optimal alignments of the sequences 
a 1 , a 2 , with respect to all possible choices of the parameters, and the normal 
fan of the polytope divides the logarithmic parameter space into regions 
which yield the same optimal alignment. This can be used for analyzing the 
sensitivity of alignments to parameters, and for the computation of posterior 
probabilities of optimal alignments. The process of computing this polytope 
is called parametric alignment or parametric inference. It is known |271I%31 Iq"%] 
that parametric inference can be done in polynomial time (in m and n). 

An important remark is that the formulation of sequence alignment with 
pair Hidden Markov models is equivalent to combinatorial "scoring schemes" 
or "generalized edit distances" which can be used to assign weights to align- 
ments [TT]. The simplest scoring scheme consists of two parameters: a mis- 
match score mis, and an indel score gap j2H]- The weight of an alignment 
is the sum of the scores for all positions in the alignment, where a match 
gets a score of 1. In the case where mis and gap are non- negative, this is 
equivalent to specializing the 33 logarithmic parameters S.. = log(s..) and 
T.. = log(£..) of the pair hidden Markov model as follows: 

Sij = 0, Tjj = Tm = —gap for all 
T M ij = -1 if i = j, and T Mij = -mis ifi^j. 
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The case where the scoring scheme consists of both positive and negative pa- 
rameters corresponds to a normalized pair hidden Markov model [TH]- This 
specialization of the parameters corresponds to projecting the Newton poly- 
tope of 00-1,0-2 into two dimensions. Parametric alignment means computing 
the resulting two-dimensional polygon. For two sequences of length n, an 
upper bound on the number of vertices in the polygon is 0(n 2 / 3 ). We have 
observed that for biological sequences the number may be much smaller. See 
[2*Tj for a survey from the perspective of computational geometry. 

In the strict technical sense, our polynomial formulation (jSJ) is not needed 
to derive or analyze combinatorial algorithms for sequence alignment. How- 
ever, the translation from algebraic geometry (J5J) to discrete optimization 
(jHJ) offers much more than just esthetically pleasing formulas. We posit that 
(tropical) algebraic geometry is a conceptual framework for developing new 
models and designing new algorithms of practical value for phylogenomics. 

6 Models of Evolution 

Because organisms from different species cannot produce offspring together, 
mutations and genome changes that occur within a species are independent 
of those occurring in another species. There are some exceptions to this 
statement, such as the known phenomenon of horizontal transfer in bacteria 
which results in the transfer of genetic material between different species, 
however we ignore such scenarios in this discussion. We can therefore repre- 
sent the evolution of species (or phyla) via a tree structure. The study of tree 
structures in genome evolution is referred to as phylogenetics. A phylogenetic 
X-tree is a tree T with all internal vertices of degree at least 3, and with the 
leaves labeled by a set X which consists of different species. In this section, 
we assume that T is known and that vertices in T correspond to known spe- 
ciation events. We begin by describing statistical models of evolution that 
are used to identify regions between genomes that are under selection. 

Evolutionary models attempt to capture three important aspects of evolv- 
ing sequences: branch length, substitution and mutation. Consider a single 
ancestral base b at the root r of a phylogenetic tree T, and assume that there 
are no insertions or deletions over time. Since the ancestral base changes, 
it is possible that at two leaves x,y G X we observe bases c\ ^ C2- We 
say that there has been a substitution between x and y. In a probabilistic 
model of evolution, we would like to capture the possibility for change along 
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internal edges of the tree, with the possibility of back substitutions as well. 
For example, it is possible that b — > c\ — > b — > ci along the path from r to x. 

Definition 7. A rate matrix (or Q-matrix) is a square matrix Q = (%)jjen 
(with rows and columns indexed by the nucleotides) satisfying the properties 

q i:j > for i ^ j, 
= for all i 6 fl, 

g«i < for all i G H. 

Rate matrices capture the notion of instantaneous rate of mutation. From 
a given rate matrix Q one computes the substitution matrices P(t) by expo- 
nentiation. The entry of P(t) in row b and column c equals the probability 
that the substitution b • • • — > c occurs in a time interval of length t We 
recall the following well-known result about continuous-time Markov models. 

Proposition 8. Let Q be any rate matrix and P(t) = e Qt = TiQ^ '• 
Then 

1. P(s + t) = P(s) + P(t), 

2. P(t) is the unique solution to P'{t) = P(t) ■ Q, P(0) = 1 for t>0, 

3. P(t) is the unique solution to P'{t) = Q ■ P(t), P(0) = 1 for t > 0. 

Furthermore, a matrix Q is a rate matrix if and only if the matrix P(t) = e Qt 
is a stochastic matrix (nonnegative with row sums equal to one) for every t. 

The simplest model is the Jukes- Cantor DNA model, whose rate matrix 
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where a > is a parameter. The corresponding substitution matrix equals 
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The expected number of substitutions over time t is the quantity 

3at = -- -trace(Q) -t = -- • logdet(P(t)). (7) 

This number is called the branch length. It can be computed from the substi- 
tution matrix P(t) and is used to weight the edges in a phylogenetic X-tree. 

One way to specify an evolutionary model is to give a phylogenetic X-tree 
T together with a rate matrix Q and an initial distribution for the root of T 
(which we here assume to be the stationary distribution on Q). The branch 
lengths of the edges are unknown parameters, and the objective is to estimate 
these branch lengths from data. Thus if the tree T has r edges, then such a 
model has r free parameters, and, according to the philosophy of algebraic 
statistics, we would like to regard it as an r-dimensional algebraic variety. 

Such an algebraic representation does indeed exist. We shall explain it for 
the Jukes-Cantor DNA model on an X-tree T. Suppose that T has r edges 
and |X| = n leaves. Let P(t) denote the substitution matrix associated with 
the i-th edge of the tree. We write ?>a{ti = — ^logdet(Pj(t)) for the branch 
length of the i-th edge, and we set 7Tj = ^(1 — e~ 4aA ) and 9,i = 1 — 3^. Thus 
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In algebraic geometry, we would regard 6{ and tt^ as the homogeneous coor- 
dinates of a (complex) projective line P 1 , but in phylogenomics we limit our 
attention to the real segment specified by 0^ > 0, 7Tj > and 0^ + 37rj = 1. 

Let A 4 n_ x denote the set of all probability distributions on Vt n . Since Q n 
has 4™ elements, namely the DNA sequences of length n, the set A 4 n_ 1 is a 
simplex of dimension 4 n — 1. We identify the j-th leaf of our tree T with the 
j-th coordinate of a DNA sequence (ui, . . . ,u n ) G Q n , and we introduce an 
unknown p U iu 2 -u n to represent the probability of observing the nucleotides 
Ui, u 2 , . . . , u n at the leaves 1,2, ... ,n. The 4 n quantities p ui „ 2 ...„ n are the 
coordinate functions on the simplex A 4 n_ l5 or, in the setting of algebraic 
geometry, on the projective space P 4 " -1 obtained by complexifying A 4 n_ 1 . 

Proposition 9. In the Jukes-Cantor model on a tree T with r edges, the 
probability p MlM2 ... Mn of making the observation (u±, u 2 , ■ ■ ■ , u n ) G f2 n at the 
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leaves is expressed as a multilinear polynomial of degree r in the model param- 
eters (#i,7Ti), (0 2 , 7r 2 ), . . . , (6 n ,7[ n ). Equivalently, in more geometric terms, 
the Jukes-Cantor model on T is the image of a multilinear map 

: (P*) r — > P 4 "- 1 . (8) 

The coordinates of the map are easily derived from the assumption 
that the substitution processes along different edges of T are independent. 
It turns out that the 4™ coordinates of are not all distinct. To see this, we 
work out the formulas explicitly for a very simple tree with three leaves. 

Example 10. Let n = r = 3, and let T be the tree with three leaves, labeled 
by X = {1, 2, 3}, directly branching off the root of T. We consider the Jukes- 
Cantor DNA model with uniform root distribution on T. This model is a 
three-dimensional algebraic variety, given as the image of a trilinear map 

: P 1 x P 1 x P 1 -> P 63 . 

The number of states in Q 3 is 4 3 = 64 but there are only five distinct poly- 
nomials occurring among the coordinates of the map 0. Let P123 be the 
probability of observing the same letter at all three leaves, pij the probabil- 
ity of observing the same letter at the leaves i,j and a different one at the 
third leaf, and pdi S the probability of seeing three distinct letters. Then 

P123 = #10203 + 37ri7r 2 7r 3 , 

Pdis = 60i7r 2 7r 3 + 67ri0 2 7r 3 + 67ri7r 2 3 + 67ri7r 2 7r 3 , 

Pl2 = 3010 2 7T 3 + 37Ti7T 2 # 3 + 67Ti7T 2 7r 3 , 

Pl3 = 30i7T 2 3 + 37Ti# 2 7T 3 + 6^!^^, 

P23 = 37Ti0 2 3 + 30i7T 2 7T 3 + 67Ti7r 2 7r 3 . 

All 64 coordinates of are given by these five trilinear polynomials, namely, 
Paaa = Pccc = Pggg = Pttt = 
Pacg = Pact = •• • = Pgtc = 
Paac — Paat = ■ ■ ■ = Pttg — 
Pag a = Pat a = •• • = Ptgt = 
Pcaa = Ptaa = ■ ■ ■ = Pgtt = 
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This means that our Jukes-Cantor model is the image of the simplified map 



0' : P^^xP 1 -> P 4 , ((e 1 ,7r 1 ),(e 2 ,7r 2 ),(e 3 ,7T 3 )) ^ (p 123 ,p dis , Pl2 , Pl3 ,p 23 ). 

In order to characterize the image of 0' algebraically, we perform the following 
linear change of coordinates: 

Qlll = Pl23 + \Vdis - \P\2 ~ \P\Z ~ |^23 = (01 - 7Ti)(0 2 - ^2) (03 ~ 7T 3 ) 

gilO = P123 - + Pl2 - §Pl3 - |^23 = (01 ~ 7Ti)(0 a ~ 7r 2 )(0 3 + 3tT 3 ) 

?101 = Pl23 - \Pdis ~ \Pl2 + Pl3 - |^23 = (01 - 7Ti)(0 2 + 3tT 2 )(6> 3 - 7T 3 ) 

gOll = Pl23 - \Pdis ~ \PV2 ~ \Pl2, + P23 = (01 + 37Ti)(0 2 - 7r 2 )(0 3 ~ 7T 3 ) 

?000 = Pl23 + Pdis + Pl2 + Pl3 + P23 = (01 + 37T 1 )(6» 2 + 37T 2 )(6» 3 + 3tT 3 ) 

This reveals that our model is the hypersurface in P 4 whose ideal equals 

It = ( gooo<?m — goii?ioi9no ) 
If we set 0j = 1 — 37Tj then we get the additional constraint g 00 o = 1- □ 

The construction in this example generalizes to arbitrary trees T. There 
exists a change of coordinates, simultaneously on the parameter space (P 1 ) 7, 
and on the probability space P 4 " -1 , such that the map in (JHJ) becomes a 
monomial map in the new coordinates. This change of coordinates is known 
as the Fourier transform or as the Hadamard conjugation (see [2311101 123 f56j). 

We regard the Jukes-Cantor DNA model on a tree T with n leaves and r 
edges as an algebraic variety of dimension r in p 4 " _1 ) namely, it is the image 
of the map (JHJ . Its homogeneous prime ideal It is generated by differences of 
monomials q a — q b in the Fourier coordinates. In the phylogenetics literature 
(including the books [2nHH]) 5 the polynomials in the ideal It are known as 
phylogenetic invariants of the model. The following result was shown in |q"o] . 

Theorem 11. The ideal It which defines the Jukes-Cantor model on a binary 
tree T is generated by monomial differences q a — q b of degree at most three. 

It makes perfect sense to allow arbitrary distinct stochastic matrices P(t) 
on the edges of the tree T. The resulting model is the general Markov model 
on the tree T. Allman and Rhodes [U |3] determined the complete system of 
phylogenetic invariants for the general Markov model on a trivalent tree T. 

An important problem in phylogenomics is to identify the maximum like- 
lihood branch lengths, given a phylogenetic X-tree T, a rate matrix Q and 
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an alignment of sequences. For the Jukes-Cantor DNA model on three taxa, 
described in Example El the exact "analytic" solution of this optimization 
problem leads to an algebraic equation of degree 23. See [22, §6] for details. 

Let us instead consider the maximum likelihood estimation problem in 
the much simpler case of the Jukes-Cantor DNA model on two taxa. Here 
the tree T has only two leaves, labeled by X = {1,2}, directly branching off 
the root of T. The model is given by a surjective bilinear map 

. P i xP i ^ F \ ((# 1)Vri ),(fl 2 ,7r 2 )) h-> { Pl2 ,p di s). (9) 

The coordinates of the map are 

Pl2 = 0i6 2 + 37T17T2, 

Vdis = 36>i7T 2 + 36*271-! + 67Ti7r 2 . 

As before, we pass to affine coordinates by setting 6 1 , = 1 — 37Tj for i = 1,2. 

One crucial difference between the model (J£5} and Example 1101 is that the 
parameters in Q are not identifiable. Indeed, the inverse image of any point 
in P 1 under the map is a curve in P 1 x P 1 . Suppose we are given data 
consisting of two aligned DNA sequences of length n where k of the bases 
are different. The corresponding point in P 1 is u = (n — k, k). The inverse 
image of u under the map is the curve in the affine plane with the equation 

12n7r 1 7T2 — "in-Ki — 3n7r 2 + k = 0. 

Every point (tti, 7r 2 ) on this curve is an exact fit for the data u = (n — k,k). 
Hence this curve equals the set of all maximum likelihood parameters for this 
model and the given data. We rewrite the equation of the curve as follows: 

4k 

(1-4*0(1-4*2) = 1-—. (10) 
Recall from (J7J) that the branch length from the root to leaf i equals 

3aA = ---logdet^)) = --.log(l-47r0. 

By taking logarithms on both sides of (fTUj) . we see that the curve of all maxi- 
mum likelihood parameters becomes a line in the branch length coordinates: 

3«iti + 3a 2 t 2 = ---log(l-— ). (11) 
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The sum on the left hand side is the distance from leaf 1 to leaf 2 in the tree 
T. Our discussion of the two-taxa model leads to the following formula which 
known in evolutionary biology 26|j under the name Jukes-Cantor correction: 

Proposition 12. Given an alignment of two sequences of length n, with k 
differences between the bases, the ML estimate of the branch length equals 

S 12 = -?.log(l-|£). 

There has been recent progress on solving the likelihood equations exactly 
for small trees [THl HU E21 HZj . We believe that these results will be useful in 
designing new algorithms for computing maximum likelihood branch lengths, 
and to better understand the mathematical properties of existing methods 
(such as fastDNAml j^U]) which are widely used by computational biologists. 

It may also be the case that T is unknown, in which case the problem is 
not to select a point on a variety, but to select from (exponentially many) 
varieties. This problem is discussed in the next section. 

The evolutionary models discussed above do not allow for insertion and 
deletion events. They also assume that sites evolve independently. Although 
many widely used models are based on these assumptions, biological reality 
calls for models that include insertion and deletion events j3T], site inter- 
actions [SOj, and the flexibility to allow for genome dynamics such as rear- 
rangements. Interested mathematicians will find a cornucopia of fascinating 
research problems arising from such more refined evolutionary models. 

7 Phylogenetic Combinatorics 

Fix a set A of n taxa. A dissimilarity map on A is a function 5 : A x A —>■ K. 
such that S(x,x) = and S(x,y) = 5(y,x). The set of all dissimilarity maps 

on A is a real vector space of dimension (™) which we identify with Ir( 2 ). A 
dissimilarity map 5 is called a metric on X if the triangle inequality holds: 

8(x,z) < 5(x,y) + 5(y, z) for x, y, z e A. 

The set of all metrics on A is a full-dimensional convex polyhedral cone in 
Ir( 2 ), called the metric cone. Phylogenetic combinatorics is concerned with 
the study of certain subsets of the metric cone which are relevant for biology. 
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This field was pioneered in the 1980's by Andreas Dress and his collaborators; 
see Dress' 1998 ICM lecture jTH] and the references given there. 

Let T be a phylogenetic X-tree whose edges have specified lengths. These 
lengths can be arbitrary non-negative real numbers. The tree T defines a 
metric 5t on X as follows: St{x, y) equals the sum of the lengths of the 
edges on the unique path in T between the leaves labeled by x and y. 

The space of X -trees is the following subset of the metric cone: 

T x = { 5 T : T is a phylogenetic X-tree } C rGO. (12) 

Metric properties of the tree space T x and its statistical and biological sig- 
nificance were studied by Billera, Holmes and Vogtmann [9 j. The following 
classical Four Point Condition characterizes membership in the tree space: 

Theorem 13. A metric 5 on X lies in Tx if and only if, for any four taxa 
u, v, x, y G X , 5(u, v) + 8(x, y) < max{<5(-u, x) + 8(v, y), 8(u, y) + 8(v, x)}. 

We refer to the book for a proof of this theorem and several variants. 
To understand the structure of Tx, let us fix the combinatorial type of a 
trivalent tree T. The number of choices of such trees is the Schroder number 

(2n-5)!! = 1-3-5 (2n - 7) ■ (2n - 5). (13) 

Since X has cardinality n, the tree T has 2n — 3 edges, and each of these 
edges corresponds to a split (A, B) of the set X into two non-empty disjoint 
subsets A and B. Let Splits (T) denote the collection of all 2n — 3 splits 
(A, B) arising from T. 

Each split (A, B) defines a split metric o~(a,b) on X as follows: 

8(A,B)( x ,y) = if (x 6 A and y G A) or (x G B and y G B), 
5<a,b){x-i y) — 1 if (x e A and y G B) or (y G A and x G B). 

The vectors : (A, B) G Splits{T) } are linearly independent in IrW. 

Their non-negative span is a cone Ct isomorphic to the orthant M^ -3 . 

Proposition 14. The space Tx of all X-trees is the union of the (2n — 5)!! 
orthants Ct- It is hence a simplicial fan of pure dimension 2n — 3 in r( 2 ). 

The tree space Tx can be identified combinatorially with a simplicial 
complex of pure dimension 2n — 4, to be denoted Tx- The vertices of Tx are 
the 2 n - 1 - 1 splits of the set X. We say that two splits (A, B) and (A', B') 
are compatible if at least one of the four sets An A', A C\B', BnA' and BdB' 
is the empty set. Here is a combinatorial characterization of the tree space: 
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Proposition 15. A collection of splits of the set X forms a face in the 
simplicial complex Tx if and only if that collection is pairwise compatible. 

The phylogenetics problem is to reconstruct a tree T from n aligned se- 
quences. In principle, one can select from evolutionary models for all possible 
trees in order to find the maximum likelihood fit. Even if the maximum like- 
lihood problem can be solved for each individual tree, this approach becomes 
infeasible in practice when n increases, because of the combinatorial explo- 
sion in the number (|13|) of trees. A number of alternative approaches have 
been suggested that attempt to find evolutionary models which fit summaries 
of the data. They build on the characterizations of trees given above. 

Distance-based methods are based on the observation that trees can be en- 
coded by metrics satisfying the Four Point Condition fTheoremll3|). Starting 
from a multiple sequence alignment, one can produce a dissimilarity map on 
the set X of taxa by computing the maximum likelihood distance between 
every pair of taxa, using Proposition 1121 The resulting dissimilarity map S 
is typically not a tree metric, i.e., it does not actually lie in the tree space 
T x . What needs to be done is to replace 5 by a nearby tree metric 5t E T x . 

The method of choice for most biologists is the neighbor-joining algorithm, 
which provides an easy-to-compute map from the cone of all metrics onto Tx- 
The algorithm is based on the following "cherry-picking theorem" |4*o1 I54j : 

Theorem 16. Let 5 be a tree metric on X . For every pair i,j E X set 

Q s {i,j) = (n-2)-6(i,j) - Y,5(hk) - 5>0'.*)- ( 14 ) 

Then the pair x,y G X that minimizes Qs{x,y) is a cherry in the tree, i.e., 
x and y are separated by only one internal vertex z in the tree. 

Neighbor- joining works as follows. Starting from an arbitrary metric 8 on 
n taxa, one sets up the n x n-matrix Q$ whose (i,j)-entry is given by the 
formula ([14)1 . and one identifies the minimum off-diagonal entry Qs(x,y). If 
8 were a tree metric then the internal vertex z which separates the leaves x 
and y would have the following distance from any other leaf k in the tree: 

5(z,k) = ±(8(x,k) + 8(y,k)-8(x,y)). (15) 

One now removes the taxa x, y and replaces them by a new taxon z whose 
distance to the remaining n — 2 taxa is given by (|T5|) . This replaces the nxn 
matrix Qs by an (n — 1) x (n — 1) matrix, and one iterates the process. 
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This neighbor- joining algorithm recursively constructs a tree T whose 
metric 5t is reasonably close to the given metric 5. If 5 is a tree metric then 
the method is guaranteed to reconstruct the correct tree. More generally, in- 
stead of estimating pairwise distances, one can attempt to (more accurately) 
estimate the sum of the branch lengths of subtrees of size m > 3. 

We define an m-dissimilarity map on X to be a function 5 : X m — > R 
such that 5(ii,i 2 , ■ ■ ■ , i m ) = 5(i n (i), M2)j • ■ • , V(m)) f° r an permutations 7r 
on {l,...,m} and «2, • • • , «m) — if the taxa ii, i 2 , . . . , i m are not 
distinct. The set of all m-dissimilarity maps on X is a real vector space of 
dimension f n ) which we identify with r(™). Every X-tree T gives rise to an 
m-dissimilarity map 5t as follows. We define 5t(h, ■ ■ ■ , i m ) to be the sum of 
all branch lengths in the subtree of T spanned by i\, . . . , i m G X. 

The following theorem f7| is a generalization of Theorem El It 
leads to a generalized neighbor-joining algorithm which provides a better 
approximation of the maximum likelihood tree and parameters: 

Theorem 17. Let T be an X-tree and m < n = \X\. For any i,j G X set 
Q T (iJ) = (£=4) £ 5r(i,j,Y)- J2 *r(i,y)~ E 

Then the pair x, y G X that minimizes QT{x,y) is a cherry in the tree T. 

The subset of Rv™) consisting of all m-dissimilarity maps 5t arising from 
trees T is a polyhedral space which is the image of the tree space Tx under 
a piecewise-linear map r( 2 ) — > r(™). We do not know a simple characteriza- 
tion of this m-version of tree-space which extends the Four Point Condition. 

Here is another natural generalization of the space of trees. Fix an m- 
dissimilarity map 5 : X m — > R and consider any (m — 2)-element subset 
Y G (^2)- We get an induced dissimilarity map 8/y on X\Y by setting 

8/ Y (i,j) = 5(i,j,Y) foraUi,; e X\Y. 

We say that 5 is an m-tree if 8/y is a tree metric for all Y G ( m X 2 )- Thus, 
by Theorem El an m-dissimilarity map 5 on X is an m-tree if 

6(i,j,Y) + 6(k,l,Y) < m^{S(i,k,Y) + S(j,l,Y),S(i,l,Y)+S{k,j,Y)} 

for all Y G ( m ^ 2 ) and all i,j,k,l G 
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Let Gr m!n denote the subset of r(™) consisting of all m-trees. The space 
Gr myn is a polyhedral fan which is slightly larger than the tropical Grass- 
mannian studied in For every m-tree^ G Gr m ^ n there is an (m — 1)- 

dimensional tree-like space whose "leaves" are the taxa in X. This is the 
tropical linear space defined in [51 . This construction, which is described in 
[5*21 §6] and O §3.5], specializes to the construction of an X-tree T from 
its metric St when m = 2. The study of m-trees and the tropical Grassman- 
nian was anticipated in jTj*H |J20| - The Dress- Wenzel theory of matroids with 
coefficients [20] contains our m-trees as a special case. The space Gr m n of 
all m-trees is discussed in the context of buildings in [19,. Note that the tree 
space T x in (fT2*j) is precisely the tropical Grassmannian Gr 2 , n . 

It is an open problem to find a natural and easy-to-compute projection 
from Rw onto Gr m ^ n which generalizes the neighbor- joining method. Such 
a variant of neighbor-joining would be likely to have applications for more 
intricate biological data that are not easily explained by a tree model. We 
close this section by discussing an example. 

Example 18. Fix a set of six taxa, X = {1,2,3,4,5,6}, and let m = 3. 
The space of 3-dissimilarity maps on X is identified with R 20 . An element 
S G R 20 is a 3-tree if 8/i is a tree metric on X\{i} for all %. Equivalently, 

6(i,j, k) + S(i,l,m) < meoc{8(i,j,l)+S(i,k,m),S(i,j,m) + S(i,k,l)} 

for all i,j, k,l,m G X. The set of all 3-trees is a 10-dimensional poly- 

hedral fan. Each cone in this fan contains the 6-dimensional linear space L 
consisting of all 3-dissimilarity maps of the particular form 

S(i,j, k) = Ui + ujj + u) k for some u G R 6 . 

The quotient Gr^/L is a 4-dimensional fan in the 14- dimensional real vec- 
torspace R 20 /L. Let Gr^ denote the three-dimensional polyhedral complex 
obtained by intersecting Gr^^/L with a sphere around the origin in R 20 /L. 

It was shown in "5*2*1 §5] that Gr 3 & is a three-dimensional simplicial com- 
plex consisting of 65 vertices, 550 edges, 1395 triangles and 1035 tetrahedra. 
Each of the 1035 tetrahedra parameterizes six-tuples of tree metrics 

(5/i,5/ 2 , Vs, 5A, 5/5,V6), 

where the tree topologies on five taxa are fixed. The homology of the tropical 
Grassmannian Gr^ is concentrated in the top dimension and is free abelian: 

H 3 (Gr 3fi ,Z) = Z 126 . 
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If T is an X-tree and St the corresponding 3-dissimilarity map (as in 
Theorem IT7j) then it is easy to check that St lies in GV 3j6 . The set of all 
3-trees S = St has codimension one in Gr^^. It is the intersection of Gr$£ 
with the 15-dimensional linear subspace of K 20 defined by the equations 



5(123) + 5(145) + 5(246) + 5(356) 
5(123) + 5(145) + 5(346) + 5(256) 
5(123) + 5(245) + 5(146) + 5(356) 
5(123) + 5(345) + 5(246) + 5(156) 
5(123) + 5(345) + 5(146) + 5(256) 



= 5(124) + 5(135) +5(236) +5(456), 

= 5(134) + 5(125) +5(236) +5(456), 

= 5(124) + 5(235) + 5(136) + 5(456), 

= 5(234) + 5(135) + 5(126) + 5(456), 

= 5(134) + 5(235) + 5(126) + 5(456). 



Working modulo L and intersecting with a suitable sphere, the tree space T x 
is a two-dimensional simplicial complex, consisting of 105 = 5!! triangles. To 
be precise, the simplicial complex in Proposition is the join of this triangu- 
lated surface with the 5-simplex on X. Theorem El relates to the following 
geometric picture: the triangulated surface T x sits inside the triangulated 
threefold GV 36 , namely, as the solution set of the five equations. □ 



8 Back to the Data 

In Section 2, a conjecture was proposed based on our finding that the "mean- 
ing of life" sequence (JTJ is present (without mutations, insertions or deletions) 
in orthologous regions in ten vertebrate genomes. In this section we explain 
how the various ideas outlined throughout this paper can be used to estimate 
the probability that such an extraordinary degree of conservation would oc- 
cur by chance. The mechanics of the calculation also provide a glimpse into 
the types of processing and analyses that are performed in computational 
biology. Two research papers dealing with this subject matter are [7| l2*Tj. 

What we shall compute in this section is the probability under the Jukes- 
Cantor model that a single ancestral base that is not under selection (and is 
therefore free to mutate) is identical in the ten present day vertebrates. 

Step 1 (the genomes): The National Center for Biotechnology In- 
formation (NCBI - http://www.ncbi.nlm.nih.gov/) maintains a public 
database called GENBANK which contains all publicly available genome se- 
quences from around the world. Large sequencing centers that receive public 
funding are generally required to deposit raw sequences into this database 
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within 24 hours of processing by sequencing machines, and thus many au- 
tomatic pipelines have been set up for generating and depositing sequences. 
The growth in GENBANK has been spectacular. The database contained 
only 680, 000 base pairs when it was started in 1982, and this number went 
up to 49 million by 1990. There are currently 44 billion base pairs of DNA 
in GENBANK. 

The ten genomes of interest are not all complete, but are all downloadable 
from GENBANK, either in pieces mapped to chromosomes (e.g. for human) 
or as collections of subsequences called contigs (for less complete genomes). 

Step 2 (annotation): In order to answer our question we need to know 
where genes are in the genomes. Some genomes have annotations that were 
derived experimentally, but all the genomes are annotated using HMMs (Sec- 
tion 4) shortly after the release of the sequence. These annotations are per- 
formed by centers such as at UC Santa Cruz (http://genome.ucsc.edu/) 
as well as by individual authors of programs. It remains an open problem to 
accurately annotate genomes. But HMM programs are quite good on aver- 
age. For example, typically 98% of coding bases are predicted correctly to 
be in genes. On the other hand, boundaries of exons are often misannotated: 
current state of the art methods only achieve accuracies of about 80% 0. 

Step 3 (alignment): We start out by performing a genome alignment. 
Current methods for aligning whole genomes are all based, to varying de- 
grees, on the pair HMM ideas of Section 5. Although in practice it is not 
possible to align sequences containing billions or even millions of base pairs 
with hidden Markov models, pair HMMs are subroutines of more complex 
alignment strategies where smaller regions for alignment are initially identi- 
fied from the entire genomes by fast string matching algorithms ^U]- The 
ten vertebrate whole genome alignments which gave rise to Conjecture^ are 
accessible at http : / /bio . math . berkeley . edu/genomes/ 

Step 4 (finding neutral DNA): In order to compute the probability 
that a certain subsequence is conserved between genomes, it is necessary to 
estimate the neutral rate of evolution. This is done by estimating parameters 
for an evolutionary model of base pairs in the genome that are not under 
selection, and are therefore free to mutate. Since neutral regions are difficult 
to identify a-priori, commonly used surrogates are synonymous substitutions 
in codons (Section 3). Because synonymous substitutions do not change 
the amino acids, it is unlikely that they are selected for or against, and 
various studies have shown that such data provide good estimates for neutral 
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Table 3: Jukes-Cantor pairwise distance estimates. 



mutation rates. By searching through the annotations and alignments, we 
identified n = 14, 202 four-fold degenerate sites. These can be used for 
analyzing probabilities of neutral mutations. 

Step 5 (deriving a metric): We would ideally like to use maximum 
likelihood techniques to reconstruct a tree T with branch lengths from the 
alignments of the four-fold degenerate sites. One approach is to try to use a 
maximum-likelihood approach, but this is difficult to do reliably because of 
the complexity of the likelihood equations, even for the Jukes-Cantor models 
with \X\ = 10. An alternative approach is to estimate pairwise distances 
between species i, j using the formula in Proposition^! The resulting metric 
on the set X = {gg, hs, mm, pt, rn, cf, dr, tn, tr, xt} is given in Table 3. For 
example, the pairwise alignment between human and chicken (extracted from 
the multiple alignment) has n = 14202 positions, of which k = 7132 are 
different. Thus, the Jukes-Cantor distance between the genomes of human 
and chicken equals 

-^•logfi-a = 4-logf^) = 0.830536... 
4 6 V 3n/ 4 6 V 42606 / 

Step 6 (building a tree): From the pairwise distances in Table 3 we 
construct a phylogenetic X-tree using the neighbor joining algorithm (Section 
7) . The tree with the inferred branch lengths is shown in Figure 0] The tree 
is drawn such that the branch lengths are consistent with the horizontal 
distances in the diagram. The root of the tree was added manually in order 
to properly indicate the ancestral relationships between the species. 
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At this point we wish to add a philosophical remark: The tree in Figure 
E|is a point on an algebraic variety! Indeed, that variety is the Jukes-Cantor 
model (Proposition EJ, and the preimage coordinates 7Tj) of that point 
are obtained by exponentiating the branch lengths as described in Section 6. 
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Figure 4: Neighbor joining tree from alignment of codons in ten vertebrates. 



Step 7 (calculating the probability): We are now given a specific 
point on the variety representing the Jukes-Cantor model on the tree depicted 
in Figure |3J Recall from Proposition |U] that this variety, and hence our 
point, lives in a projective space of dimension 4 10 — 1 = 1,048,575. What 
we are interested in are four specific coordinates of that point, namely, the 
probabilities that the same nucleotide occurs in every species: 

Paaaaaaaaaa = Pcccccccccc = Pgggggggggg = Ptttttttttt (16) 

As discussed in Section 6, this expression is a multilinear polynomial in 
the edge parameters fa). When we evaluate it at the parameters derived 
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from the branch lengths in Figure 0] we find that 

Paaaaaaaaaa = 0.009651... 

Returning to the "meaning of life" sequence (JTJ, this implies the following 

Proposition 19. Assuming the probability distribution on Q 10 given by the 
Jukes-Cantor model on the tree in Figure^ the probability of observing a 
sequence of length 42 unchanged at a given location in the ten vertebrate 
genomes within a neutrally evolving region equals (0.038604) 42 = 4.3-10~ 60 . 

This calculation did not take into account the fact that the "meaning of 
life" sequence may occur in an arbitrary location of the genome in question. 
In order to adjust for this, we can multiply the number in Proposition EH 
by the length of the genomes. The human genome contains approximately 
2.8 billion nucleotides, so it is reasonable to conclude that the probability of 
observing a sequence of length 42 unchanged somewhere in the ten vertebrate 
genomes is approximately 

2.8 • 10 9 x 4.3 • 10~ 60 ~ 10~ 50 . 

This probability is a very small number, i.e., it is unlikely that the remarkable 
properties of the sequence (0) occurred by "chance" . Despite the shortcom- 
ings of the Jukes-Cantor model discussed at the end of Section 6, we believe 
that PropositionHniconstitutes a sound argument in support of Conjecture 1. 
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